Chau Tran
2023-11-06
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ purrr 1.0.2
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Registering fonts with R
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(lubridate)
library(readr)
air <- read_csv("Downloads/Airplane_crashes_dataset_since_1908.csv")## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 5008 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (15): date, time, location, operator, flight_no, route, ac_type, registr...
## dbl (2): all_aboard, all_fatalities
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
As we can see, the Date column is in the format Month Day, Year
Because I will analyze on separate month, separate day, and separate year, I will extract the years from the Data column.
First, I check the data type of the ‘date’ column. The ‘date’ column is in character.
## [1] "character"
I convert Date from character to class ‘date’. I extract year from the ‘date’ column
air$date <- as.Date(air$date, format = "%B %d, %Y")
air$year <- as.numeric(format(air$date, "%Y"))
air %>% mutate( year = air$year)## # A tibble: 5,008 × 18
## date time location operator flight_no route ac_type registration cn_ln
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1908-09-17 1718 Fort My… Militar… ? Demo… Wright… ? 1
## 2 1909-09-07 ? Juvisy-… ? ? Air … Wright… SC1 ?
## 3 1912-07-12 0630 Atlanti… Militar… ? Test… Dirigi… ? ?
## 4 1913-08-06 ? Victori… Private ? ? Curtis… ? ?
## 5 1913-09-09 1830 Over th… Militar… ? ? Zeppel… ? ?
## 6 1913-10-17 1030 Near Jo… Militar… ? ? Zeppel… ? ?
## 7 1915-03-05 0100 Tienen,… Militar… ? ? Zeppel… ? ?
## 8 1915-09-03 1520 Off Cux… Militar… ? ? Zeppel… ? ?
## 9 1916-07-28 ? Near Ja… Militar… ? ? Schutt… ? ?
## 10 1916-09-24 0100 Billeri… Militar… ? ? Zeppel… ? ?
## # ℹ 4,998 more rows
## # ℹ 9 more variables: all_aboard <dbl>, passengers_aboard <chr>,
## # crew_aboard <chr>, all_fatalities <dbl>, passenger_fatalities <chr>,
## # crew_fatalities <chr>, ground <chr>, summary <chr>, year <dbl>
I used the same process to extract day from the ‘date’ column.
air$date <- as.Date(air$date, format = "%B %d, %Y")
air$day <- as.numeric(format(air$date, "%d"))
air %>% mutate( day = factor(air$day, ordered = T))## # A tibble: 5,008 × 19
## date time location operator flight_no route ac_type registration cn_ln
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1908-09-17 1718 Fort My… Militar… ? Demo… Wright… ? 1
## 2 1909-09-07 ? Juvisy-… ? ? Air … Wright… SC1 ?
## 3 1912-07-12 0630 Atlanti… Militar… ? Test… Dirigi… ? ?
## 4 1913-08-06 ? Victori… Private ? ? Curtis… ? ?
## 5 1913-09-09 1830 Over th… Militar… ? ? Zeppel… ? ?
## 6 1913-10-17 1030 Near Jo… Militar… ? ? Zeppel… ? ?
## 7 1915-03-05 0100 Tienen,… Militar… ? ? Zeppel… ? ?
## 8 1915-09-03 1520 Off Cux… Militar… ? ? Zeppel… ? ?
## 9 1916-07-28 ? Near Ja… Militar… ? ? Schutt… ? ?
## 10 1916-09-24 0100 Billeri… Militar… ? ? Zeppel… ? ?
## # ℹ 4,998 more rows
## # ℹ 10 more variables: all_aboard <dbl>, passengers_aboard <chr>,
## # crew_aboard <chr>, all_fatalities <dbl>, passenger_fatalities <chr>,
## # crew_fatalities <chr>, ground <chr>, summary <chr>, year <dbl>, day <ord>
I use the same process to extract month from the ‘date’ column.
air$date <- as.Date(air$date, format = "%B %d, %Y")
air$month <- as.numeric(format(air$date, "%m"))
air %>% mutate( month = factor(air$month, ordered = T))## # A tibble: 5,008 × 20
## date time location operator flight_no route ac_type registration cn_ln
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1908-09-17 1718 Fort My… Militar… ? Demo… Wright… ? 1
## 2 1909-09-07 ? Juvisy-… ? ? Air … Wright… SC1 ?
## 3 1912-07-12 0630 Atlanti… Militar… ? Test… Dirigi… ? ?
## 4 1913-08-06 ? Victori… Private ? ? Curtis… ? ?
## 5 1913-09-09 1830 Over th… Militar… ? ? Zeppel… ? ?
## 6 1913-10-17 1030 Near Jo… Militar… ? ? Zeppel… ? ?
## 7 1915-03-05 0100 Tienen,… Militar… ? ? Zeppel… ? ?
## 8 1915-09-03 1520 Off Cux… Militar… ? ? Zeppel… ? ?
## 9 1916-07-28 ? Near Ja… Militar… ? ? Schutt… ? ?
## 10 1916-09-24 0100 Billeri… Militar… ? ? Zeppel… ? ?
## # ℹ 4,998 more rows
## # ℹ 11 more variables: all_aboard <dbl>, passengers_aboard <chr>,
## # crew_aboard <chr>, all_fatalities <dbl>, passenger_fatalities <chr>,
## # crew_fatalities <chr>, ground <chr>, summary <chr>, year <dbl>, day <dbl>,
## # month <ord>
The missing values in the time column accounts for 30% of the data in the column.
The missing values in the flight column accounts for 74% of the data in the column.
The missing values in the registration column accounts for 5% of the data in the column.
The missing values in the cn/ln column accounts for 13% of the data in the column.
The time column is not doable in my ability
I decide to drop those 5 columns.
The ‘location’ column is in the format city/region/country. I decide to split the values in the ‘location’ column so I can perform analysis on individual areas.
First, I use the str_split_fixed() function to split the column. For example: Victoria, British, Canada becomes |Victoria| and |British, Canada|.
Second, I use the gsub() function to erase the part before the comma in the second column that I just split. For example: British, Canada becomes Canada. Finally, I add that column to my data set and name it Country.
When I look at the number of fatalities and the number of passenger aboard, I think that I could obtain the number of survivals by subtracting the number of fatalities from number of passengers aboard. Therefore, I subtract the values in the Fatalities column from the values in the Aboard column and then add those new values to my data set and name it Survival.
I like to get the ratio of the crew fatalities by crew fatalities/ crew aboard
## [1] "character"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
I like to get the ratio of the passenger fatalities by passenger fatalities/ passenger aboard
## [1] "character"
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
DATA DESCRIPTION
Date: Date of accident, in the format - January 01, 2001
Time: Local time, in 24 hr. format unless otherwise specified
Operator: Airline or operator of the aircraft
Flight #: Flight number assigned by the aircraft operator
Route: Complete or partial route flown prior to the accident
AC Type: Aircraft type
Reg: ICAO registration of the aircraft
cn / ln: Construction or serial number / Line or fuselage number
Aboard: Total aboard (passengers / crew)
Passengers aboard : Passengers abroad
Crew aboard : Crew abroad
All fatalities : Total fatalities aboard (passengers / crew)
Passenger fatalities: Total Passenger fatalities
Crew fatalities: Total Crew fatalities
Ground: Total killed on the ground
Summary: Brief description of the accident and cause if known
Domain question:
How has our chance to survive an airplane crash or not get into a crash evolved over 113 years (1908-2021) ?
Other questions:
What are various factors contributing to airplane crashes over 113 years?
How large is our chance to survive depending on our roles on the plane ?
How much damage will be caused when an airplane crash and to what extent ? (number of death )
Is there a safe month or day to fly ?
What type of airplane is the most dangerous to fly ?
What operator is the most dangerous to fly ?
What are the most dangerous countries to fly ?
QUANTITATIVE VARIABLE (NUMERICAL)
I will have a quick comment on some of the metrics in the data set.
As we can see, the maximum and minimum values of the year column are 1908 and 2021. The crashes in our data are stated to range from 1908 and 2021 in the metadata. Therefore, the maximum and minimum values in the year column are valid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1908 1951 1970 1971 1992 2021
Next, the third Quartile of the ground column is 0. Therefore, we know that there are at least 75% of the crashes that did not kill any people on the ground. The mean, however, is 2 people. The maximum value of the ground column is 2750 people. That indicates there is a crash that killed 2750 people on the ground. This extreme outlier is why the mean is greater than the median and even greater than the third Quartile. Overall, it is bad to know such a big number of people killed, but now we could know that most of the time from 1908 to 2021, when a plane crashes, and we were not in the plane, we were very safe!
## Warning: NAs introduced by coercion
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 1.718 0.000 2750.000 44
Now, let’s look at the statistics of number of passengers aboard and number of fatalities together! The minimum, median, maximum of the number of passengers aboard are 0, 16, 644 respectively. The minimum, median, maximum of the number of fatalities are 0, 11, 583 respectively. We can sort of see that the number of fatalities increases as the number of passenger aboard increases.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 7.00 16.00 31.12 35.00 644.00 17
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 4.00 11.00 22.29 25.00 583.00 8
NUMBER OF PASSENGER ON BOARD AND NUMBER OF FATALITIES BY YEARS
Looking at the graph, we can see a trend that the number of fatalities (red line) increases as the number of passengers (blue line) increases. There seems to be a big difference between the 2 lines in year 1999. That difference is the number of survivors.
We can also notice that there is a slight pattern of the gap between the red line and the green line, we have many big gaps, many tiny to no gaps, and some moderate gaps. This might indicate a nature of the relationship between the number of survival and the number of fatalities.
Let’s look at the next graph to investigate more about this!
AES:
red: FATALITIES
green: Passengers
air = air[!is.na(air$all_aboard), ]
air = air[!is.na(air$all_fatalities), ]
view(air)
a = air %>% group_by(year) %>% summarise( Fa = sum(all_fatalities))
b = air %>% group_by(year) %>% summarise( A = sum(all_aboard))
c = inner_join(a,b)## Joining with `by = join_by(year)`
Based on the graph, the number of fatalities is almost always very higher than the number of survivals as there is a big gap between the red line and the green line frequently. From 1908 to 1998, there are always a lot more people killed than survived. Only in year 1999, the number of survivors was higher than the number of fatalities a little bit. After year 1999, however, the number of fatalities went back to being higher than the number of survivals again from 2000 to 2021. Therefore, I guess that there is not any odd or long term factors in year 1999 that made the more people survived because the change is just a bit.
AES:
red: FATALITIES
green: SURVIVAL
## Warning: Unknown or uninitialised column: `Fatalities`.
## [1] "NULL"
a = air %>% group_by(year) %>% summarise( Fatalities = sum(all_fatalities))
b = air %>% group_by(year) %>% summarise( Survivors = sum(Survival))
c = inner_join(a,b)## Joining with `by = join_by(year)`
As we can see, year 1999 just happens to have 3 big flights on date 7/23, 8/22, and 12/21 and the number of survivors on those three flights are 516, 312, and 298 respectively, which are large numbers. The crash on 7/23 also has the highest number of survivors out of all years.
In summary, year 1999 happens to have 3 big flights in which almost all of the passengers survived, bringing up the total number of survivors in that year, making it bigger than the number of fatalities in the entire year although year 1999 had many others flights where almost all passengers died.
After analysis on both graphs, we can see that whenever a plane crashes, there has been always tiny chance to survive from 1908 to 2021. There was not any stable factor that bring up our chance to survive. Year 1999 was just random.
AES:
red: FATALITIES
green: SURVIVAL
a = air %>% filter(year == 1999) %>% group_by(date) %>%summarise( S = sum(Survival))
b = air %>% filter(year == 1999) %>% group_by(date) %>%summarise( Fa = sum(all_fatalities))
c = inner_join(a,b)## Joining with `by = join_by(date)`
Let’s move on to our question about the nature of the relationship between the number of survival and the number of fatalities. Looking at the scatter plot, we can see that there is no point in the lower triangle, meaning that the number of fatalities is always less than or equal to the number of passengers on board. Because the number of passengers killed on board by the crashes cannot be higher than the number of passengers on board, our data is valid.
Next, we can see that most of the points concentrate on the 2 sides of the triangle and the rest lies between. The points on the skewed line are the crashes in which the number of fatalities and the number of passengers aboard are equal while the points that lie on the line x = 0 are the crashes in which the numbers of fatalities are 0.
Therefore, most of the time, when a plane crashed, and there were survivors, it was likely that almost all of the passengers survived. On the other hand, if there were fatalities, almost all of the passengers did not survive. As for the rest, we might have some parts survived and some parts dead.
We can see the same pattern here. The numbers of survivors are 0 or the numbers of fatalities are 0.
Based on the graph, the number of crashes increased from 1908 to around 1944 and then stay in that range to 2000.
The number of crashes in year 1946 was the highest.
After that, we can see a trend that the number of crashes starts to decrease from year 1994 to year 2021. As we might have known, the time period 1944-2000 is the time frame of the Cold War. A lot of planes were utilized during the war and a lot of them were shot.
AES: title: NUMBER OF CRASHES BY YEAR
Looking at the bar graph, the Aeroflot and Military - U.S. Air Force are two operators that have the most crashes
air %>% group_by(operator) %>% summarise( accident = n()) %>% arrange(-accident) %>% head(10) %>%
ggplot() + aes(x = operator, y = accident, fill = operator) + geom_col() + coord_flip()Based on the bar graph, Douglas DC-3 has the most crashes.
air %>% group_by(ac_type) %>% summarise( accident = n()) %>% arrange(-accident) %>% head(10) %>%
ggplot() + aes(x = ac_type, y = accident, fill = ac_type) + geom_col() + coord_flip()As we can see, the number of passenger killed aboard is almost always greater than the number of people killed on the grounds except for year 2001 in which the number of passenger killed on the ground is much greater than the number of people dead aboard.
AES: add the title “THE NUMBER OF FATALITIES ON BOARD AND THE NUMBER OF FATALITIES ON GROUND”
red: FATALITIES ON BOARD black: FATALITIES ON GROUND
air = air[!is.na(air$all_fatalities), ]
air = air[!is.na(air$ground), ]
a = air %>% group_by(year) %>% summarise( Fatalities = sum(all_fatalities))
b = air %>% group_by(year) %>% summarise( G = sum(ground))
c = inner_join(a,b)## Joining with `by = join_by(year)`
As we can see, the number of crashes is distributed quite equally among the months. Therefore, no matter what month we choose to fly, there is barely a safe month.
air = air[!is.na(air$month), ]
air %>% group_by(month) %>% summarise( accident = n()) %>%
ggplot() + aes(x = month, y = accident, fill = month) + geom_col()As we can see, the number of crashes is distributed almost equally among the days. Therefore, no matter what day we choose to fly, there is barely a safe day.
air = air[!is.na(air$day), ]
air %>% group_by(day) %>% summarise( accident = n()) %>%
ggplot() + aes(x = day, y = accident, fill = day) + geom_col() As we can see, the people who live in Russia, the United States, and Brazil tend to get into an airplane crash.
air = air[-which(air$Country == ""), ]
air %>% group_by(Country) %>% summarise( crashes = n()) %>% arrange(-crashes) %>% head(10) %>%
ggplot() + aes( x = Country, y = crashes, fill = Country) + geom_col() As we can see, whether your role was a crew member or a passenger, your survival chance remains pretty much the same.
AES:
title: Crew Fatalities and Passenger Fatalities
orange: Crew Fatalities Ratio
purple: Passenger Fatalities Ratio
air = air[!is.na(air$`Crew Fatalities Ratio`), ]
air = air[!is.na(air$`Passenger Fatalities Ratio`), ]
a = air %>% group_by(year) %>% summarise( Crew = sum(`Crew Fatalities Ratio`))
b = air %>% group_by(year) %>% summarise( Passenger = sum(`Passenger Fatalities Ratio`))
c = inner_join(a,b)## Joining with `by = join_by(year)`
#colors = c("Crew" = "orange", "Passenger" = "purple")
#p1 = c %>% ggplot( aes( x= year))
#p1 = p1 + geom_line(aes(y = Crew), color = "Crew")
#p1 = p1 + geom_line(aes(y = Passenger), color = "Passenger")
#p1 + scale_color_manual(values = colors)
p1 = c %>% ggplot( aes( x= year))
p1 = p1 + geom_line(aes(y = Crew), color = "orange")
p1 = p1 + geom_line(aes(y = Passenger), color = "purple")
p1